Add router replay for MoE models#2101
Merged
Phlip79 merged 24 commits intoNVIDIA:mainfrom Jan 27, 2026
Merged
Conversation
f8fa9f4 to
cbd57c1
Compare
3 tasks
ISEEKYAN
reviewed
Nov 20, 2025
Contributor
|
@litianjian it is better if we add a doc to give a minimal demo for R2 as guidance |
8e9ef4b to
f6eb81d
Compare
f6eb81d to
bd32db8
Compare
48 tasks
ISEEKYAN
pushed a commit
to verl-project/verl
that referenced
this pull request
Dec 4, 2025
### What does this PR do? This PR introduces a draft **Router Replay** support into Verl. Inspired by the recent research in **MoE Reinforcement Learning**([2510.11370](https://arxiv.org/abs/2510.11370), [2507.18071](https://arxiv.org/abs/2507.18071)), this implementation supports **Router Replay (R2)** and **Rollout Router Replay (R3)**. R2 allows recording routing token selection during` log probability computation` and replaying expert selection during policy update. R3 enables recording during `model inference` and replaying during RL post-training. The initial version supports **Router Replay** with `Megatron` backend, including comprehensive support for distributed training strategies (**DP, TP, EP, ETP, PP, and Re-compute**). The current implementation uses a patch-based approach. Once the upstream PR [NVIDIA/Megatron-LM#2101](NVIDIA/Megatron-LM#2101) is merged or provides corresponding interfaces, the patch can be removed and replaced with official API integration. ## Usage Tutorial ### Basic Configuration To enable Router Replay functionality, add the following configuration to your trainer config: #### Method 1: Trainer Configuration Add the following configuration to your trainer config: ```yaml router_replay: enabled: true mode: "R2" # Options: "R2", "R3" ``` #### Method 2: Launch Script Configuration Add the following parameter to your launch script: ```bash # In your launch script actor_rollout_ref.actor.router_replay.mode="R2" ``` ### R2 Mode Usage 1. **Enable R2 mode** in configuration 2. **Record phase**: During log probability computation, routing selections are automatically recorded 3. **Replay phase**: During policy update, recorded expert selections are replayed ### R3 Mode Usage 1. **Enable R3 mode** in configuration 2. **Record phase**: During model inference, routing decisions are captured 3. **Replay phase**: During RL post-training, recorded routing data is used 4. ## In Progress R2 - [ ] FSDP backend R3 - [x] vLLM Rollout - [ ] Sglang Rollout --------- Co-authored-by: litianjian <litianjian@bytedance.com> Co-authored-by: zhangbiao.168 <zhangbiao.168@bytedance.com>
kvareddy
approved these changes
Dec 8, 2025
ISEEKYAN
approved these changes
Dec 9, 2025
ISEEKYAN
approved these changes
Dec 12, 2025
Contributor
|
/ok to test 3260df1 |
Contributor
|
@jon-barker Does this look okay now? |
Member
|
/ok to test 02a65df |
Member
|
/ok to test dc2d3b2 |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
This PR introduces a "Router Replay" feature for Mixture-of-Experts (MoE) layers. This functionality provides a deterministic routing mechanism, which is essential for debugging, controlled experimentation, and reproducing model behavior.
Inspired by recent approaches in stabilizing MoE models Router Replay(R2) and Rollout Router Replay(R3),
RouterReplayimplementation allows developers to easily save and set the router's replay information, providing precise control over the expert selection process to mitigate routing inconsistencieImplementation Details:
RouterReplay, is introduced inmoe_utils.pyto manage the state and data for the replay functionality.TopKRouter:topk_routing_with_score_functionduring the routing process.topk_routing_with_score_functionhas been updated to handle the router_replay object.Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.